Materials and setup

NOTE: skip this section if you are not running R locally (e.g., if you are running R in your browser using a remote Jupyter server)

You should have R installed –if not:

Download workshop materials:

What is R?

R is a programming language designed for statistical computing. Notable characteristics include:

  • Vast capabilities, wide range of statistical and graphical techniques
  • Very popular in academia, growing popularity in business: http://r4stats.com/articles/popularity/
  • Written primarily by statisticians
  • FREE (no cost, open source)
  • Excellent community support: mailing list, blogs, tutorials
  • Easy to extend by writing new functions

InspiRation

OK, it’s free and popular, but what makes R worth learning? In a word, “packages”. If you have a data manipulation, analysis or visualization task, chances are good that there is an R package for that. Lets install some packages and look at some examples.

Where are we?

library(ggmap)
nwbuilding <- geocode("1737 Cambridge Street Cambridge, MA 02138", source = "google") 
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=1737%20Cambridge%20Street%20Cambridge,%20MA%2002138&sensor=false
ggmap(get_map("Cambridge, MA", zoom = 15)) +
  geom_point(data=nwbuilding, size = 7, shape = 13, color = "red")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Cambridge,+MA&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Cambridge,%20MA&sensor=false

What will world population be in 2020?

library(forecast)
library(plotly)

## from https://esa.un.org/unpd/wpp/Download/Standard/Population/
worldpop <- structure(c(2.525149312, 2.571867515, 2.617940399, 2.66402901,
2.710677773, 2.758314525, 2.807246148, 2.85766291, 2.909651396,
2.963216053, 3.018343828, 3.075073173, 3.133554362, 3.194075347,
3.256988501, 3.322495121, 3.390685523, 3.461343172, 3.533966901,
3.607865513, 3.682487691, 3.757734668, 3.833594894, 3.90972212,
3.985733775, 4.061399228, 4.13654207, 4.211322427, 4.286282447,
4.362189531, 4.439632465, 4.518602042, 4.599003374, 4.681210508,
4.765657562, 4.852540569, 4.942056118, 5.033804944, 5.126632694,
5.218978019, 5.309667699, 5.398328753, 5.485115276, 5.57004538,
5.653315893, 5.735123084, 5.815392305, 5.894155105, 5.971882825,
6.049205203, 6.126622121, 6.204310739, 6.282301767, 6.360764684,
6.439842408, 6.51963585, 6.600220247, 6.68160732, 6.763732879,
6.846479521, 6.92972504300001, 7.013427052, 7.097500453, 7.181715139,
7.265785946, 7.349472099), .Tsp = c(1950, 2015, 1), class = "ts")

## Projected numbers (in billions) of humans living on earth
fit <- auto.arima(worldpop)
ggplotly(autoplot(forecast(fit)))
  • Want to interactively explore the shape of the Churyumov–Gerasimenko comet?
comet <- rgl::readOBJ(url("http://sci.esa.int/science-e/www/object/doc.cfm?fobjectid=54726"))
plot_ly(x = comet$vb[1,],
        y = comet$vb[2,],
        z = comet$vb[3,],
        i = comet$it[1,]-1,
        j= comet$it[2,]-1,
        k = comet$it[3,]-1,
        type = "mesh3d")

Whatever you’re trying to do, you’re probably not the first to try doing it R. Chances are good that someone has already written a package for that.

Graphical User Interfaces (GUIs)

R GUI alternatives

The old-school way is to run R directly in a terminal

But hardly anybody does it that way anymore! The Windows version of R comes with a GUI that looks like this:

The default windows GUI is not very good

  • No parentheses matching or syntax highlighting
  • No work-space browser

RStudio (an alternative GUI for R) is shown below.

Rstudio has many useful features, including parentheses matching and auto-completion. Rstudio is not the only advanced R interface; other alteratives include Emacs with ESS (shown below).

Emacs + ESS is a very powerful combination, but can be difficult to set up.

Jupyter is a notebook interface that runs in your web browser. A lot of people like it. You can access these workshop notes as a Jupyter notebook at http://tutorials-live.iq.harvard.edu:8000/notebooks/workshops/R/Rintro/Rintro.ipynb

Launch RStudio (skip if not using Rstudio)

Note: skip this section if you are not using Rstudio (e.g., if you are running these examples in a Jupyter notebook).

  • Open the RStudio program
  • Open up today’s R script
    • In RStudio, Go to File => Open Script
    • Locate and open the Rintro.R script in the Rintro folder on your desktop
  • Go to Tools => Set working directory => To source file location (more on the working directory later)
  • I encourage you to add your own notes to this file! Every line that starts with # is a comment that will be ignored by R. My comments all start with ##; you can add your own, possibly using # or ### to distinguish your comments from mine.

Now that we know what we’re getting into and have our environment set up, let’s get to work.

Exercise 0

The purpose of this exercise is mostly to give you an opportunity to explore the interface provided by RStudio (or whichever GUI you’ve decided to use). You may not know how to do these things; that’s fine! This is an opportunity to learn. If you don’t know how to do something you can can use internet search engines, search on StackOverflow, or ask the person next to you.

Also keep in mind that we are living in a golden age of tab completion. If you don’t know the name of an R function, try guessing the first two or three letters and pressing TAB. If you guessed correctly the function you are looking for should appear in a pop up!

  1. Try to get R to add 2 plus 2.
  2. Try to calculate the square root of 10.
  3. There is an R package named car. Try to install this package.
  4. R includes extensive documentation, including a file named “An introduction to R”. Try to find this help file.
  5. Open a new web browser or tab, go to http://cran.r-project.org/web/views/ and skim the topic closest to your field/interests.

Exercise 0 solution

  1. Add 2 plus 2.
2 + 2
## [1] 4
sum(2, 2)
## [1] 4
  1. Calculate the square root of 10:
sqrt(10)
## [1] 3.162278
10^(1/2)
## [1] 3.162278
  1. Install the “car” package:

In Rstudio, go to the “Packages” tab and click the “Istall” button. Search in the pop-up window and click “Install”.

Alternatively, use the install.packages function like this:

install.packages("car")
## Installing package into '/home/izahn/R/x86_64-pc-linux-gnu-library/3.3'
## (as 'lib' is unspecified)
  1. Find “An Introduction to R”.

=Go to the main help page by running ’help.start() or using the GUI menu, find and click on the link to “An Introduction to R”.=

  1. Go to http://cran.r-project.org/web/views/ and skim the topic closest to your field/interests.

I like the machine learning topic.

Example project overview: baby names!

I would like to know what the most popular baby names are. In the course of answering this question we will learn to call R functions, install and load packages, assign values to names, read and write data, and more.

The examples in this workshop use the baby names data provided by the governments of the United States and the United Kingdom. A cleaned and merged version of these data is in dataSets/babyNames.csv.

Our first goal is to read these data into R. In order to do that we need to learn how to call functions, install packages, set out working directory, read as .csv file, and assign the result to a name. Lets get to it.

R installing and using packages

There are thousands of R packages that extend R’s capabilities. Some packages are distributed with R, and some of these are attached to the search path by default. Many more are available in package repositories.

In order to make reading and analyzing our baby names data easier we will install and use a collection of packages called tidyverse. tidyverse is a meta package that loads the dplyr package for easier data manipulation the readr package for easier data import/export, and several other useful packages.

Packages can be installed using the install.packages function.

Functions

The general form for calling R functions is

## FunctionName(arg.1 = value.1, arg.2 = value.2, ..., arg.n - value.n)

Arguments can be matched by position or name. Lets see how that works, using the install.packages function.

Installing and using R packages

Since this is the first time we are using the install.packages function we will start by looking up its help page. This is almost always the first thing you should do when using a function for the first time. You can look up the help page for a function like this:

?install.packages

As we can see from the documentation, the first (and only required) argument is named pkgs. Additional arguments specify where this package should be installed from (repos) and to (lib) among other things.

OK, lets install the “car” package from the repo at “https://cran.rstudio.com”.

install.packages("", repos = "https://cran.rstudio.com")
## Installing package into '/home/izahn/R/x86_64-pc-linux-gnu-library/3.3'
## (as 'lib' is unspecified)
## Warning: package '' is not available (for R version 3.3.2)

Installing a package puts a copy of the package on your local computer, but does not make it available for use. To use an installed package you must attach it using the library function.

library("car")

Asking R for help

Now that we’ve installed the car package, how do we use it? We’ve already seen that we can look up the help page using ?. This is actually a shortcut to the help function:

help(help)

The help function can be used to look up the documentation for a function, or to look up the documentation to a package. We can learn how to use the car package by reading its documentation like this:

help(package = "car")

Exercise 1

The purpose of this exercise is to practice using the package management and help facilities.

  1. Install the tidyverse package.
  2. Use the library function to attach the tidyverse package.
  3. Look up the help page for the readr package (readr is attached by the tidyverse package). Which function would you use to read a comma separated values (.csv) file?

Exercise 1 solution

## 1. install the tidyverse pacakge
install.packages("tidyverse")
## Installing package into '/home/izahn/R/x86_64-pc-linux-gnu-library/3.3'
## (as 'lib' is unspecified)
## 2. attach the tidyverse pacakge
library("tidyverse")
## 3. look up the readr package documentation
help(package = "readr")
## I would use read_tsv to read a tab delimited file.

Now that we have installed and attached the tidyverse (and readr) packages, and know which function to use to read our data (read_csv) we are almost ready to read in the baby names data. Before we do that lets take a small excision to learn about assignment and basic data types in R.

Data types and assignment

Assignment

Values can be assigned names and used in subsequent operations

  • The <- operator (less than followed by a dash) is used to save values
  • The name on the left gets the value on the right.
x <- 10 # Assign the value 10 to a variable named x
x + 1 # Add 1 to x
## [1] 11
x # note that x is unchanged
## [1] 10
y <- x + 1 # Assign y the value x + 1
y
## [1] 11
x <- x + 100 # change the value of x
y ## note that y is unchanged.
## [1] 11

Data types and conversion

The x and y data objects we created are numeric vectors of length one. Vectors are the simplest data structure in R, and are the building blocks used to make more complex data structures. Here are some more vector examples.

x <- c(10, 11, 12)
X <- c("10", "11", "12")
y <- c("h", "e", "l", "l", "o")
Y <- "hello"
z <- c(TRUE, FALSE, TRUE, TRUE)

Notice that the c function combines its arguments into a vector.

All R objects have a type (aka mode) and length. Since it is impossible for an object not to have these attributes they are called intrinsic attributes.

print(x)
## [1] 10 11 12
typeof(x)
## [1] "double"
length(x)
## [1] 3
print(X)
## [1] "10" "11" "12"
typeof(X)
## [1] "character"
length(X)
## [1] 3
print(y)
## [1] "h" "e" "l" "l" "o"
length(y)
## [1] 5
print(Y)
## [1] "hello"
length(Y)
## [1] 1
print(z)
## [1]  TRUE FALSE  TRUE  TRUE
typeof(z)
## [1] "logical"

Data structures in R can be converted from one type to another using one of the many functions beginning with as.. For example:

print(x)
## [1] 10 11 12
mode(x)
## [1] "numeric"
mode(as.character(x))
## [1] "character"
print(X)
## [1] "10" "11" "12"
mode(X)
## [1] "character"
mode(as.numeric(X))
## [1] "numeric"

Now that we know how to do assignment using <- and how to understand basic data types in R we are finally ready to read in the baby names data.

Getting data into R

The “working directory” and listing files

R knows the directory it was started in, and refers to this as the “working directory”. Since our workshop examples are in the Rintro folder, we should all take a moment to set that as our working directory.

getwd() # what is my current working directory?
# setwd("~/Desktop/Rintro") # change directory

Note that “~” means “my home directory” but that this can mean different things on different operating systems. You can also use the Files tab in Rstudio to navigate to a directory, then click “More -> Set as working directory”.

We can a set the working directory using paths relative to the current working directory. Once we are in the “Rintro” folder we can navigate to the “dataSets” folder like this:

getwd() # get the current working directory
## [1] "/home/izahn/Documents/Work/Classes/IQSS_Stats_Workshops/R/Rintro"
setwd("dataSets") # set wd to the dataSets folder
getwd()
## [1] "/home/izahn/Documents/Work/Classes/IQSS_Stats_Workshops/R/Rintro/dataSets"
setwd("..") # set wd to enclosing folder ("up")
getwd()
## [1] "/home/izahn/Documents/Work/Classes/IQSS_Stats_Workshops/R"

It can be convenient to list files in a directory without leaving R

list.files("dataSets") # list files in the dataSets folder
## [1] "babyNames.csv"

Readers for common file types

In order to read data from a file, you have to know what kind of file it is. The table below lists the functions that can import data from common file formats.

data type function package
comma separated (.csv) read_csv() readr (tidyverse)
other delimited formats read_delim() readr (tidyverse)
R (.Rds) read_rds() readr (tidyverse)
Stata (.dta) read_stata() haven (tidyverse, needs to be attached separately)
SPSS (.sav) read_spss() haven (tidyverse, needs to be attached separately)
SAS (.sas7bdat) read_sas() haven (tidyverse, needs to be attached separately)
Excel (.xls, .xlsx) read_excel() readxl (tidyverse, needs to be attached separately)

Exercise 2

The purpose of this exercise is to practice reading data into R. The data in “dataSets/babyNames.csv” is moderately tricky to read, making it a good data set to practice on.

  1. Open the help page for the read_csv function. How can you limit the number of rows to be read in?
  2. Read just the first 10 rows of “dataSets/babyNames.csv”. Notice that the “Sex” column has been read as a logical (TRUE/FALSE).
  3. Read the read_csv help page to figure out how to make it read the “Sex” column as a character. Make adjustments to your code until you have read in the first 10 rows with the correct column types. “Year” and “Name.length” should be integer (int), “Count” and “Percent” should be double (dbl) and everything else should be character (chr).
  4. Once you have successfully read in the first 10 rows, read the whole file, assigning the result to the name baby.names.

Exercise 2 solution

## read ?read_csv
## limit rows with n_max argument
read_csv("dataSets/babyNames.csv", n_max = 10)
## Parsed with column specification:
## cols(
##   Location = col_character(),
##   Year = col_integer(),
##   Sex = col_logical(),
##   Name = col_character(),
##   Count = col_double(),
##   Percent = col_double(),
##   Name.length = col_integer()
## )
## specify column types in the col_types argument
read_csv("dataSets/babyNames.csv", n_max = 10, col_types = "??c????")

## read all the data
baby.names <- read_csv("dataSets/babyNames.csv", col_types = "??c????")

Checking imported data

It is always a good idea to examine the imported data set–usually we want the results to be a data.frame

## we know that this object will have mode and length, because all R objects do.
mode(baby.names)
## [1] "list"
length(baby.names) # number of columns
## [1] 7
## additional information about this data object
class(baby.names) # check to see that test is a data.frame
## [1] "tbl_df"     "tbl"        "data.frame"
dim(baby.names) # how many rows and columns?
## [1] 1966001       7
names(baby.names) # or colnames(baby.names)
## [1] "Location"    "Year"        "Sex"         "Name"        "Count"      
## [6] "Percent"     "Name.length"
str(baby.names) # more details
## Classes 'tbl_df', 'tbl' and 'data.frame':    1966001 obs. of  7 variables:
##  $ Location   : chr  "England and Wales" "England and Wales" "England and Wales" "England and Wales" ...
##  $ Year       : int  1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
##  $ Sex        : chr  "F" "F" "F" "F" ...
##  $ Name       : chr  "sophie" "chloe" "jessica" "emily" ...
##  $ Count      : num  7087 6824 6711 6415 6299 ...
##  $ Percent    : num  2.39 2.31 2.27 2.17 2.13 ...
##  $ Name.length: int  6 5 7 5 6 6 9 7 3 5 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 7
##   .. ..$ Location   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Year       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Sex        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Name       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Count      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Percent    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Name.length: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"
glimpse(baby.names) # details, more compactly
## Observations: 1,966,001
## Variables: 7
## $ Location    <chr> "England and Wales", "England and Wales", "England...
## $ Year        <int> 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 19...
## $ Sex         <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", ...
## $ Name        <chr> "sophie", "chloe", "jessica", "emily", "lauren", "...
## $ Count       <dbl> 7087, 6824, 6711, 6415, 6299, 5916, 5866, 5828, 52...
## $ Percent     <dbl> 2.3942729, 2.3054210, 2.2672450, 2.1672444, 2.1280...
## $ Name.length <int> 6, 5, 7, 5, 6, 6, 9, 7, 3, 5, 7, 5, 7, 4, 4, 7, 5,...

Data Manipulation

data.frame objects

Usually data read into R will be stored as a data.frame

  • A data.frame is a list of vectors of equal length
    • Each vector in the list forms a column
    • Each column can be a differnt type of vector
    • Typically columns are variables and the rows are observations

A data.frame has two dimensions corresponding the number of rows and the number of columns (in that order)

Slice and Filter data.frames rows

You can extract subsets of data.frames using slice to select rows by number and filter to select rows that match some condition. It works like this:

## make up some example data
(example.df <- data.frame(id  = rep(letters[1:4], each = 4),
                          t   = rep(1:4, times = 4),
                          var1 = runif(16),
                          var2 = sample(letters[1:3], 16, replace = TRUE)))
##    id t       var1 var2
## 1   a 1 0.25439062    a
## 2   a 2 0.18972348    c
## 3   a 3 0.70377912    b
## 4   a 4 0.40702740    c
## 5   b 1 0.87109466    c
## 6   b 2 0.48599201    c
## 7   b 3 0.24660803    a
## 8   b 4 0.40431428    a
## 9   c 1 0.65827318    b
## 10  c 2 0.75715090    c
## 11  c 3 0.98883031    a
## 12  c 4 0.52909527    a
## 13  d 1 0.03079849    b
## 14  d 2 0.30094221    a
## 15  d 3 0.42010827    c
## 16  d 4 0.07955623    b
## rows 2 and 4
slice(example.df, c(2, 4))
##   id t      var1 var2
## 1  a 2 0.1897235    c
## 2  a 4 0.4070274    c
## rows where id == "a"
filter(example.df, id == "a")
##   id t      var1 var2
## 1  a 1 0.2543906    a
## 2  a 2 0.1897235    c
## 3  a 3 0.7037791    b
## 4  a 4 0.4070274    c
## rows where id is either "a" or "b"
filter(example.df, id %in% c("a", "b"))
##   id t      var1 var2
## 1  a 1 0.2543906    a
## 2  a 2 0.1897235    c
## 3  a 3 0.7037791    b
## 4  a 4 0.4070274    c
## 5  b 1 0.8710947    c
## 6  b 2 0.4859920    c
## 7  b 3 0.2466080    a
## 8  b 4 0.4043143    a

Select data.frame columns

slice and filter are used to extract rows. select is used to extract columns

select(example.df, id, var1)
##    id       var1
## 1   a 0.25439062
## 2   a 0.18972348
## 3   a 0.70377912
## 4   a 0.40702740
## 5   b 0.87109466
## 6   b 0.48599201
## 7   b 0.24660803
## 8   b 0.40431428
## 9   c 0.65827318
## 10  c 0.75715090
## 11  c 0.98883031
## 12  c 0.52909527
## 13  d 0.03079849
## 14  d 0.30094221
## 15  d 0.42010827
## 16  d 0.07955623
select(example.df, id, t, var1)
##    id t       var1
## 1   a 1 0.25439062
## 2   a 2 0.18972348
## 3   a 3 0.70377912
## 4   a 4 0.40702740
## 5   b 1 0.87109466
## 6   b 2 0.48599201
## 7   b 3 0.24660803
## 8   b 4 0.40431428
## 9   c 1 0.65827318
## 10  c 2 0.75715090
## 11  c 3 0.98883031
## 12  c 4 0.52909527
## 13  d 1 0.03079849
## 14  d 2 0.30094221
## 15  d 3 0.42010827
## 16  d 4 0.07955623

You can also conveniently select a single column using $, like this:

example.df$t
##  [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

Data manipulation commands can be combined:

filter(select(example.df,
              id,
              var1),
       id == "a")
##   id      var1
## 1  a 0.2543906
## 2  a 0.1897235
## 3  a 0.7037791
## 4  a 0.4070274

In the previous example we used == to filter rows where id was “a”. Other relational and logical operators are listed below.

Operator Meaning
== equal to
!= not equal to
> greater than
>= greater than or equal to
< less than
<= less than or equal to
%in% contained in
& and
| or

Adding, removing, and modifying data.frame columns

You can modify data.frames using the mutate() function. It works like this:

example.df
##    id t       var1 var2
## 1   a 1 0.25439062    a
## 2   a 2 0.18972348    c
## 3   a 3 0.70377912    b
## 4   a 4 0.40702740    c
## 5   b 1 0.87109466    c
## 6   b 2 0.48599201    c
## 7   b 3 0.24660803    a
## 8   b 4 0.40431428    a
## 9   c 1 0.65827318    b
## 10  c 2 0.75715090    c
## 11  c 3 0.98883031    a
## 12  c 4 0.52909527    a
## 13  d 1 0.03079849    b
## 14  d 2 0.30094221    a
## 15  d 3 0.42010827    c
## 16  d 4 0.07955623    b
## modify example.df and assign the modified data.frame the name example.df
example.df <- mutate(example.df,
       var2 = var1/t, # replace the values in var2
       var3 = 1:length(t), # create a new column named var3
       var4 = factor(letters[t]),
       t = NULL # delete the column named t
       )
## examine our changes
example.df
##    id       var1       var2 var3 var4
## 1   a 0.25439062 0.25439062    1    a
## 2   a 0.18972348 0.09486174    2    b
## 3   a 0.70377912 0.23459304    3    c
## 4   a 0.40702740 0.10175685    4    d
## 5   b 0.87109466 0.87109466    5    a
## 6   b 0.48599201 0.24299600    6    b
## 7   b 0.24660803 0.08220268    7    c
## 8   b 0.40431428 0.10107857    8    d
## 9   c 0.65827318 0.65827318    9    a
## 10  c 0.75715090 0.37857545   10    b
## 11  c 0.98883031 0.32961010   11    c
## 12  c 0.52909527 0.13227382   12    d
## 13  d 0.03079849 0.03079849   13    a
## 14  d 0.30094221 0.15047110   14    b
## 15  d 0.42010827 0.14003609   15    c
## 16  d 0.07955623 0.01988906   16    d

Exporting Data

Now that we have made some changes to our data set, we might want to save those changes to a file.

# write data to a .csv file
write_csv(example.df, path = "example.csv")

# write data to an R file
write_rds(example.df, path = "example.rds")

# write data to a Stata file
library(haven)
write_dta(example.df, path = "example.dta")

Saving and loading R workspaces

In addition to importing individual datasets, R can save and load entire workspaces

ls() # list objects in our workspace
##  [1] "a"                       "baby.names"             
##  [3] "births.by.year"          "comet"                  
##  [5] "comet.plot"              "example.df"             
##  [7] "fit"                     "name.length.by.location"
##  [9] "nwbuilding"              "orig.search.path"       
## [11] "popular.girl.names"      "w"                      
## [13] "W"                       "worldpop"               
## [15] "x"                       "X"                      
## [17] "y"                       "Y"                      
## [19] "z"                       "Z"
save.image(file="myWorkspace.RData") # save workspace 
rm(list=ls()) # remove all objects from our workspace 
ls() # list stored objects to make sure they are deleted
## character(0)

Load the “myWorkspace.RData” file and check that it is restored

load("myWorkspace.RData") # load myWorkspace.RData
ls() # list objects
##  [1] "a"                       "baby.names"             
##  [3] "births.by.year"          "comet"                  
##  [5] "comet.plot"              "example.df"             
##  [7] "fit"                     "name.length.by.location"
##  [9] "nwbuilding"              "orig.search.path"       
## [11] "popular.girl.names"      "w"                      
## [13] "W"                       "worldpop"               
## [15] "x"                       "X"                      
## [17] "y"                       "Y"                      
## [19] "z"                       "Z"

Exercise 3: Data manipulation

Read in the “babyNames.csv” file if you have not already done so, assigning the result to baby.names.

  1. Filter baby.names to show only names given to at least 5 percent of boys.
  2. Create a column named “Proportion” equal to Percent divided by 100.
  3. Filter baby.names to include only names given to at least 3 percent of Girls. Save this to a Stata data set named “popularGirlNames.dta”)

Exercise 3 solution

filter(baby.names, Sex == "M" & Percent >= 5)
## # A tibble: 0 × 7
## # ... with 7 variables: Location <chr>, Year <int>, Sex <chr>, Name <chr>,
## #   Count <dbl>, Percent <dbl>, Name.length <int>
baby.names <- mutate(baby.names, Proportion = Percent/100)

popular.girl.names <- filter(baby.names, Sex == "F" & Percent >= 3)

write_csv(popular.girl.names, path = "popularGirlNames.dta")

Basic Statistics and Graphs

Basic statistics

Descriptive statistics of single variables are straightforward:

sum(example.df$var1) # calculate sum of var 1
## [1] 7.327684
mean(example.df$var1)
## [1] 0.4579803
median(example.df$var1)
## [1] 0.4135678
sd(example.df$var1) # calculate standard deviation of var1
## [1] 0.2785311
var(example.df$var1)
## [1] 0.07757959
## summaries of individual columns
summary(example.df$var1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0308  0.2524  0.4136  0.4580  0.6696  0.9888
summary(example.df$var2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01989 0.09952 0.14530 0.23890 0.27320 0.87110
## summary of whole data.frame
summary(example.df)
##  id         var1             var2              var3       var4 
##  a:4   Min.   :0.0308   Min.   :0.01989   Min.   : 1.00   a:4  
##  b:4   1st Qu.:0.2524   1st Qu.:0.09952   1st Qu.: 4.75   b:4  
##  c:4   Median :0.4136   Median :0.14525   Median : 8.50   c:4  
##  d:4   Mean   :0.4580   Mean   :0.23893   Mean   : 8.50   d:4  
##        3rd Qu.:0.6696   3rd Qu.:0.27320   3rd Qu.:12.25        
##        Max.   :0.9888   Max.   :0.87109   Max.   :16.00

Some of these functions (e.g., summary) will also work with data.frames and other types of objects, others (such as sd) will not.

Statistics by grouping variable(s)

The summarize function can be used to calculate statistics by grouping variable. Here is how it works.

summarize(group_by(example.df, id), mean(var1), sd(var1))
## # A tibble: 4 × 3
##       id `mean(var1)` `sd(var1)`
##   <fctr>        <dbl>      <dbl>
## 1      a    0.3887302  0.2289406
## 2      b    0.5020022  0.2653643
## 3      c    0.7333374  0.1942449
## 4      d    0.2078513  0.1839622

You can group by multiple variables:

summarize(group_by(example.df, id, var3), mean(var1), sd(var1))
## Source: local data frame [16 x 4]
## Groups: id [?]
## 
##        id  var3 `mean(var1)` `sd(var1)`
##    <fctr> <int>        <dbl>      <dbl>
## 1       a     1   0.25439062         NA
## 2       a     2   0.18972348         NA
## 3       a     3   0.70377912         NA
## 4       a     4   0.40702740         NA
## 5       b     5   0.87109466         NA
## 6       b     6   0.48599201         NA
## 7       b     7   0.24660803         NA
## 8       b     8   0.40431428         NA
## 9       c     9   0.65827318         NA
## 10      c    10   0.75715090         NA
## 11      c    11   0.98883031         NA
## 12      c    12   0.52909527         NA
## 13      d    13   0.03079849         NA
## 14      d    14   0.30094221         NA
## 15      d    15   0.42010827         NA
## 16      d    16   0.07955623         NA

Save R output to a file

Earlier we learned how to write a data set to a file. But what if we want to write something that isn’t in a nice rectangular format, like the output of summary? For that we can use the sink() function:

sink(file="output.txt", split=TRUE) # start logging
print("This is the summary of example.df \n")
## [1] "This is the summary of example.df \n"
print(summary(example.df))
##  id         var1             var2              var3       var4 
##  a:4   Min.   :0.0308   Min.   :0.01989   Min.   : 1.00   a:4  
##  b:4   1st Qu.:0.2524   1st Qu.:0.09952   1st Qu.: 4.75   b:4  
##  c:4   Median :0.4136   Median :0.14525   Median : 8.50   c:4  
##  d:4   Mean   :0.4580   Mean   :0.23893   Mean   : 8.50   d:4  
##        3rd Qu.:0.6696   3rd Qu.:0.27320   3rd Qu.:12.25        
##        Max.   :0.9888   Max.   :0.87109   Max.   :16.00
sink() ## sink with no arguments turns logging off

Exercise 4

  1. Calculate the total number of children born.
  2. Filter the data to extract only Massachusetts (Location “MA”), and calculate the total number of children born in Massachusetts.
  3. Group and summarize the data to calculate the number of children born each year. Assign the result to the name births.by.year.
  4. Calculate the average number of characters in baby names (using the “Name.length” column).
  5. Group and summarize to calculate the average number of characters in baby names for each location. Assign the result to the name name.length.by.location.

Exercise 4 solution

sum(baby.names$Count)
## [1] 76865321
sum(filter(baby.names, Location == "MA")$Count)
## [1] 1232841
births.by.year <- summarize(group_by(baby.names, Year), sum(Count))

mean(baby.names$Name.length)
## [1] 5.978752
name.length.by.location <- summarize(group_by(baby.names, Location), mean(Name.length))

Basic graphics: Frequency bars

Thanks to classes and methods, you can plot() many kinds of objects:

plot(example.df$var4)

Basic graphics: Boxplots by group

Thanks to classes and methods, you can plot() many kinds of objects:

plot(select(example.df, id, var1))

Basic graphics: Mosaic chart

Thanks to classes and methods, you can plot() many kinds of objects:

plot(select(example.df, id, var4))

Basic graphics: scatter plot

plot(select(example.df, var1, var2))

Exercise 5 TBD

Wrap-up

Help us make this workshop better!

Additional resources